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Until  recently,  the  primary,  single  aspect  of  HPEC  systems  that  has  been  most  critical  has  been 
"performance",  in  terms  of  processor  speeds  and  I/O  throughput.  As  processor  speeds  and  I/O 
throughput  has  continued  to  increase,  and  as  the  capability  to  build  larger  and  larger  systems  has 
improved,  the  need  for  raw  performance  is  becoming  less  critical.  Now,  it  is  the  ability  to 
achieve  a  high  level  of  application  availability  that  is  becoming  as  critical  as  performance. 

In  this  paper,  we  will  present  a  CORBA  based  framework  upon  which  highly  available 
applications  can  be  constructed.  This  framework,  known  as  the  Health  Maintenance  System, 
provides  the  application,  system  managers,  and  management  tools  with  the  ability  to  "manage" 
all  resources  within  a  system  such  that  the  "health"  of  the  system  can  be  maintained.  The 
management  of  these  resources  involves  the  ability  to  "sense"  the  state  of  the  resource,  to  control 
the  resource,  and  to  run  tests  on  the  resource  in  order  to  pro-actively  detect  any  latent  problems. 

The  primary  facet  of  the  framework  is  the  "resource  manager".  The  resource  managers  provide 
local  management  support  for  all  system  resources.  In  addition,  the  resource  managers  provide 
management  access  to  clients,  e.g.,  the  application.  This  access  is  provided  via  a  set  of  "client 
interface"  modules  that  provide  a  wide  variety  of  interfaces,  e.g.,  APIs,  agents,  etc.  It  is  this 
combination  of  resource  managers  and  client  interface  modules  that  allow  the  framework  to  be 
easily  configured  for  a  specific  HPEC  system. 
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•  Goal:  Increase  Mean  Time  To  Failure 

•  Classes 

•Dual  Redundancy  (Hot  Fail  Over) 

•Triple  Redundancy  (Result  Comparison) 

•  Redundancy  at  System/Component  Level 

•  Drawbacks: 

•  High  Costs 

•  Low  Density 

•  Additional  Complexity 
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Two  Basic  Tenets: 

•  Failure  Rates  of  Both  Software  and  Hardware  are  Non-Negligible 
and  Increasing 

•  Systems  Cannot  be  Completely  Modeled  for  Reliability  Analysis 
(thus  their  failure  modes  cannot  be  predicted  in  advance) 


Goal; 

•  Decrease  Mean  Time  to  Repair 


ROC  Mechanisms: 

•  Detection  (Sensing  and  Diagnotics) 

•  Isolation 

•  Use  of  Excess  Capacity  (if  available) 

•  Repair/Recovery 
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Hardware: 

•  Quality  Components 

•  Builtin  Sensing  of  all  Major  Components 

•  Control  of  all  Major  Components  (reset,  etc.) 

•  Excess  Capacity  (where  possible) 

os  Middleware: 

•  Quality  Components 

•  Builtin  Sensing  of  all  Major  Components 

•  Control  of  all  Major  Components 


Application: 

•  Quality  Components 

•  Builtin  Sensing  of  all  Major  Components 

•  Control  of  all  Major  Components 

•  Qverall  System  Management  (Sensing  and  Control) 
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•  HAA  Support  Blade 

^  Tini  Management  Processor  (Java 
Processor) 

^  I2C  Integration 

^  TCP/IP  External  Access 

•  Compute/IO  Blades 

^  Out-of-band  Management  Controller 
^  Temperature  Monitoring 
^  Voltage  Monitoring 
^  Heart  Beat  Monitor 
^  Power  Control/Reset 
^  I2C  Integration 

•  Chassis 

^  Fan  Monitoring 
^  Voltage  Monitoring 
^  Power  Control/Reset 


i 
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GOALS: 

•  Provide  Capability  to  Instrument  OS,  Middleware,  and  Application 

(analogous  to  hardware  instrumentation) 

•  Provide  Uniform  View  of  Entire  System 

(hardware,  OS,  middleware,  and  application) 

•  Provide  Integrated  Diagnotics 

•  Provide  Access  Using  Standard  Interfaces 

•  Minimal  Performance  Impact 

•  Easily  Extensible  and  Configurable 

(in  order  to  meet  individual  application  requirements) 
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•  Server  Objects:  Sensors,  Controllers,  and  Timers 

-  Embedded  within  the  hardware,  OS,  middleware,  and  application 

-  Combined  into  a  Resource  Object 

•  Clients:  Application,  Management  Tools,  and  Users 

•  Communication:  Event  Driven,  Request  Driven,  and 

Timer  Driven  Messaging 

•  Lookup  Services 

•  Extensible 

^  Can  support  an  arbitrary  number  of  servers  and  clients 
^  Application  developers  can  add  application  specific  servers 

•  Configurable 


^  Which  servers  and  clients  are  to  run 
^  When  and  where  they  are  to  run 
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Example  HMS  Based  System 


Application 


System 
Mgr  (person) 


System 
Mgmt  Tools 
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•Used  to  Monitor  Resource  Usage  (Development  and  Runtime) 

^  Hardware  (temperature,  voltage,  etc) 

^  OS/Middleware  (processor  load,  data  throughput,  etc) 

^  Application  (queue  lengths,  wait  times,  etc) 

•Used  to  Manage  These  Resources 

•Used  to  Detect  and  Isolate  Faults 

•Used  to  Predict  Possible  Future  Faults 

•Used  to  Gather  Statistics  on  Resource  Usage  and  Performance 

•Used  to  Determine  the  Health  of  Resources  (Diagnotics) 
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Future  Directions 
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•  Tight  Integration  with  SKY  Analysis  Tools 

•  Tight  Integration  with  SKY  Development  Tools 

•  Pattern-based  Application  Recovery  Libraries 

•  Dynamic  Insertion  of  Sensors/Controllers  (Dynamic  Probes) 

•  Support  for  Other  Hardware  Environments  (Hot-Swap) 


©  SKY  Computers,  Inc.  Company  Confidential.  All  Rights  Reserved  08/27/03  Slide  1 


